{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cbd5a364",
   "metadata": {},
   "source": [
    "# HF Checkpoints with LlamaIndex and LangChain\n",
    "\n",
    "This notebook demonstrates how to plug in a local llm from [HuggingFace Hub Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and [all-MiniLM-L6-v2 embedding from Huggingface](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), bind these to into [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) with these customizations.\n",
    "\n",
    "The custom plug-ins shown in this notebook can be replaced, for example, you can swap out the [HuggingFace Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) with [HuggingFace checkpoint from Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).\n",
    "\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    \n",
    "⚠️ The notebook before this one, `08_Option(1)_llama_index_with_NVIDIA_AI_endpoint.ipynb`, contains the same exercise as this notebook but uses NVIDIA AI Catelog's models via API calls instead of loading the models' checkpoints pulled from huggingface model hub, and then load from host to devices (i.e GPUs).\n",
    "\n",
    "Noted that, since we will load the checkpoints, it will be significantly slower to go through this entire notebook. \n",
    "\n",
    "If you do decide to go through this notebook, please kindly check the **Prerequisite** section below.\n",
    "\n",
    "There are continous development and retrieval techniques supported in LlamaIndex and this notebook just shows how to quickly replace components such as llm and embedding per user's choice, read more [documentation on llama-index](https://docs.llamaindex.ai/en/stable/) for the latest nformation. \n",
    "\n",
    "</div>\n",
    "\n",
    "### Prerequisite \n",
    "In order to successfully run this notebook, you will need the following -\n",
    "\n",
    "1. Already being approved of using the checkpoints via applying for [meta-llama](https://huggingface.co/meta-llama)\n",
    "2. At least 2 NVIDIA GPUs, each with at least 32G mem, preferably using Ampere architecture\n",
    "3. docker and [nvidia-docker](https://github.com/NVIDIA/nvidia-container-toolkit) installed \n",
    "4. Registered [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/) and can pull and run NGC pytorch containers\n",
    "5. install necesary python dependencies : \n",
    "Note: if you are using the [Dockerfile.gpu_notebook](https://raw.githubusercontent.com/NVIDIA/GenerativeAIExamples/main/notebooks/Dockerfile.gpu_notebook), it should already prepare the environment for you. Otherwise please refer to the Dockerfile for environment building.\n",
    "\n",
    "In this notebook, we will cover the following custom plug-in components -\n",
    "\n",
    "    - LLM locally load from [HuggingFace Hub Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and warp this into llama-index \n",
    "    \n",
    "    - A [HuggingFace embedding all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) \n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3786d093",
   "metadata": {},
   "source": [
    "### Step 1 - Load [HuggingFace Hub Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) \n",
    "\n",
    "\n",
    "Note: Scroll down and make sure you supply the **hf_token in code block below, replace [FILL_IN] with your huggingface token** \n",
    ", for how to generate the token from huggingface, please following instruction from [this link](https://huggingface.co/docs/transformers.js/guides/private)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e080a53",
   "metadata": {},
   "outputs": [],
   "source": [
    "## uncomment the below if you have not yet install the python dependencies\n",
    "#!pip install accelerate transformers==4.33.1 --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8fcd1582",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import sys\n",
    "\n",
    "logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
    "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
    "import os\n",
    "from IPython.display import Markdown, display\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer\n",
    "import torch\n",
    "\n",
    "def load_hf_model(model_name_or_path, device, num_gpus,hf_auth_token, debug=False):\n",
    "    \"\"\"Load an HF locally saved checkpoint.\"\"\"\n",
    "    if device == \"cpu\":\n",
    "        kwargs = {}\n",
    "    elif device == \"cuda\":\n",
    "        kwargs = {\"torch_dtype\": torch.float16}\n",
    "        if num_gpus == \"auto\":\n",
    "            kwargs[\"device_map\"] = \"auto\"\n",
    "        else:\n",
    "            num_gpus = int(num_gpus)\n",
    "            if num_gpus != 1:\n",
    "                kwargs.update(\n",
    "                    {\n",
    "                        \"device_map\": \"auto\",\n",
    "                        \"max_memory\": {i: \"13GiB\" for i in range(num_gpus)},\n",
    "                    }\n",
    "                )\n",
    "    elif device == \"mps\":\n",
    "        kwargs = {\"torch_dtype\": torch.float16}\n",
    "        # Avoid bugs in mps backend by not using in-place operations.\n",
    "        print(\"mps not supported\")\n",
    "    else:\n",
    "        raise ValueError(f\"Invalid device: {device}\")\n",
    "\n",
    "    if hf_auth_token is None:\n",
    "        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)\n",
    "        model = AutoModelForCausalLM.from_pretrained(\n",
    "            model_name_or_path, low_cpu_mem_usage=True, **kwargs\n",
    "        )\n",
    "    else:\n",
    "        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_auth_token=hf_auth_token, use_fast=False)\n",
    "        model = AutoModelForCausalLM.from_pretrained(\n",
    "            model_name_or_path, low_cpu_mem_usage=True,use_auth_token=hf_auth_token, **kwargs\n",
    "        )\n",
    "\n",
    "    if device == \"cuda\" and num_gpus == 1:\n",
    "        model.to(device)\n",
    "\n",
    "    if debug:\n",
    "        print(model)\n",
    "\n",
    "    return model, tokenizer\n",
    "\n",
    "\n",
    "\n",
    "# Define variable to hold llama2 weights naming\n",
    "model_name_or_path = \"meta-llama/Llama-2-13b-chat-hf\"\n",
    "# Set auth token variable from hugging face\n",
    "# Create tokenizer\n",
    "hf_token= \"[FILL_IN]\"\n",
    "device = \"cuda\"\n",
    "num_gpus = 2\n",
    "\n",
    "model, tokenizer = load_hf_model(model_name_or_path, device, num_gpus,hf_auth_token=hf_token, debug=False)\n",
    "# Setup a prompt\n",
    "prompt = \"### User:What is the fastest car in  \\\n",
    "          the world and how much does it cost? \\\n",
    "          ### Assistant:\"\n",
    "# Pass the prompt to the tokenizer\n",
    "inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
    "# Setup the text streamer\n",
    "streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10945486",
   "metadata": {},
   "source": [
    "run a test and see the model generating output response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3fd772cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=100)\n",
    "# Covert the output tokens back to text\n",
    "output_text = tokenizer.decode(output[0], skip_special_tokens=True)\n",
    "output_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8735d640",
   "metadata": {},
   "source": [
    "### Step 2 - Construct prompt template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e53c0e45",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the prompt wrapper...but for llama index\n",
    "from llama_index.prompts.prompts import SimpleInputPrompt\n",
    "# Create a system prompt\n",
    "system_prompt = \"\"\"<<SYS>>\n",
    "You are a helpful, respectful and honest assistant. Always answer as\n",
    "helpfully as possible, while being safe. Your answers should not include\n",
    "any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.\n",
    "Please ensure that your responses are socially unbiased and positive in nature.\n",
    "\n",
    "If a question does not make any sense, or is not factually coherent, explain\n",
    "why instead of answering something not correct. If you don't know the answer\n",
    "to a question, please don't share false information.\n",
    "\n",
    "Your goal is to provide answers relating to the financial performance of\n",
    "the company.<</SYS>>[INST]\n",
    "\"\"\"\n",
    "# Throw together the query wrapper\n",
    "query_wrapper_prompt = SimpleInputPrompt(\"{query_str} [/INST]\")\n",
    "## do a test query\n",
    "query_str='What can you help me with?'\n",
    "query_wrapper_prompt.format(query_str=query_str)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a675f3",
   "metadata": {},
   "source": [
    "### Step 3 - Load the chosen huggingface Embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f14a0a8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create and dl embeddings instance wrapping huggingface embedding into langchain embedding\n",
    "# Bring in embeddings wrapper\n",
    "from llama_index.embeddings import LangchainEmbedding\n",
    "# Bring in HF embeddings - need these to represent document chunks\n",
    "from langchain.embeddings.huggingface import HuggingFaceEmbeddings\n",
    "embeddings=LangchainEmbedding(\n",
    "    HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2df926bc",
   "metadata": {},
   "source": [
    "### Step 4 - Prepare the locally loaded huggingface llm into into llamaindex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e541fb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the llama index HF Wrapper\n",
    "from llama_index.llms import HuggingFaceLLM\n",
    "# Create a HF LLM using the llama index wrapper\n",
    "llm = HuggingFaceLLM(context_window=4096,\n",
    "                    max_new_tokens=256,\n",
    "                    system_prompt=system_prompt,\n",
    "                    query_wrapper_prompt=query_wrapper_prompt,\n",
    "                    model=model,\n",
    "                    tokenizer=tokenizer)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd3275e7",
   "metadata": {},
   "source": [
    "### Step 5 - Wrap the custom embedding and the locally loaded huggingface llm into llama-index's ServiceContext"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fa1aeeb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bring in stuff to change service context\n",
    "from llama_index import set_global_service_context\n",
    "from llama_index import ServiceContext"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6284cadd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create new service context instance\n",
    "service_context = ServiceContext.from_defaults(\n",
    "    chunk_size=1024,\n",
    "    llm=llm,\n",
    "    embed_model=embeddings\n",
    ")\n",
    "# And set the service context\n",
    "set_global_service_context(service_context)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88937615",
   "metadata": {},
   "source": [
    "### Step 6a - Load the text data using llama-index's SimpleDirectoryReader and we will be using the built-in [VectorStoreIndex](https://docs.llamaindex.ai/en/latest/community/integrations/vector_stores.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63d060f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "#create query engine with cross encoder reranker\n",
    "from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext\n",
    "import torch\n",
    "\n",
    "documents = SimpleDirectoryReader(\"./toy_data\").load_data()\n",
    "index = VectorStoreIndex.from_documents(documents, service_context=service_context)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe5db8f7",
   "metadata": {},
   "source": [
    "### Step 6b - This will serve as the query engine for us to ask questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3fc661a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup index query engine using LLM\n",
    "query_engine = index.as_query_engine()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d6fe0b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test out a query in natural\n",
    "response = query_engine.query(\"Tell me about Sweden's population?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e402baa",
   "metadata": {},
   "outputs": [],
   "source": [
    "response.metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23748347",
   "metadata": {},
   "outputs": [],
   "source": [
    "response.response"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
